COBS: A Background

COBS is a field trial comparing biofuel systems and represents a unique opprotunity to utilize NGS in order to explore the distrubution of bacterail species associated with carbon cycleing and how they are distrubuted amongst the aggregate fractions in. Why do we care about agreggate fractions and the bacterial distrubution in each? We know that tillage and cropping systems have an effect on the distribution of aggregate sizes. We also know that bacterial species eveness is not consistant accross aggregate fractions. Let’s explore these points further by looking into the distribution of bacteria with the potential for carbohydrate degredation. The results from this study may inform agricultural management decisions based on the desired outcome for carbon cycling. I.E. more or less tillage to drive the aggregate size distribution and therefore the microbial community. Perhaps we can store more carbon in the soil with proper management techniques that facilitate the growth of a desired aggregate class based on the funciton you wish to achieve in the ecosystem.

We can do this by quantifying the amount of bacteria that have genes associated with carbon degrading enzymes of intrest. These enzymes are:

  1. endoglucanase [EC:3.2.1.4]
  2. pullulanase [EC:3.2.1.41]
  3. 1,4-beta-cellobiosidase [EC:3.2.1.91]
  4. beta-glucosidase [EC:3.2.1.21]
  5. beta-D-xylosidase 4 [EC:3.2.1.37]
  6. 6-phospho-beta-glucosidase [EC:3.2.1.86]

First, let’s confirm that we have the required tools for asking questions of the NCBI database. To do this we will use the NCBI API: Entrez Direct and install it via the UNIX command line.

pwd
cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh

Let’s take a look to confirm that it was installed in the home directory:

cd ~
ls
Applications
Box Sync
Desktop
Documents
Downloads
Google Drive
Keys
Library
Movies
Music
Pictures
Public
anaconda
chiAISME.R
edirect

Now that we have the tools, what will we use them for? COBS is a unique dataset that includes many metagenomes from soil aggregates. Many functions in ecosystems are microbially mediated and catalyzed by enzymes. In this study, funded by the DOE, we are interested in carnon cycleing and therefore the enyzmes associated with that funciton. Our collaborator (Kirsten Hofmockel) has identified a few enzymes of interest for our system. The list of enzymes is contained in a text file: ec_numbers.txt Let’s navigate to that directory and take a look:

cd ~/Documents/COBS_CAZY
cat ec_numbers.txt
3.2.1.4
3.2.1.91
3.2.1.21
3.2.1.37
3.2.1.41
3.2.1.86

We now have a list of enzymes of interest, let’s use those NCBI tools to find bacterial sequences associated with those enzymes by creating a python script:

import sys
from Bio import Entrez, SeqIO

Entrez.email = 'jflater@iastate.edu'

# First, find entries that contain the E.C. number
ec_num = sys.argv[1].strip()
#print ec_num
#print 'E.C. '+ ec_num
esearch_handle = Entrez.esearch(db='nucleotide', term='EC '+ec_num)
# When term='E.C. we get zero results, however, if term=EC it works
entries = Entrez.read(esearch_handle)
esearch_handle.close()

# Second, fetch these entries
efetch_handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb', retmode='xml') 
records = Entrez.parse(efetch_handle)

# Now, we go through the records and look for a feature with name 'EC_number'
for record in records:
      for feature in record['GBSeq_feature-table']:
          for subfeature in feature['GBFeature_quals']:
              if (subfeature['GBQualifier_name'] == 'EC_number'   and
                subfeature['GBQualifier_value'] == ec_num):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

                    print('>GenBank Accession:{}'.format(accession))
                    print(seq.seq)
efetch_handle.close()

Now let’s apply that script to our list of EC #’s by using the command line:

while read line;     
  do python scripts/nucl_from_ec.py $line > "$line".txt;    
  done < ec_numbers.txt
cd Documents/COBS_CAZY
less 3.2.1.37.txt 
>GenBank Accession:CP013238
ATGAATGCAAGATATTTTTATGTTGATAATTATATAATAAGCGCAAAAGATATTAAAGGACCTTGGAGTGAGCCGGTATATATTCATTCATCGGGATTTGATGCATCAATTTTCCATGATGATGATGGAAAGAAATATATAGTGTCTCTTGAGTGGGAAACAAGAGAAGGATATGAAAAACCAGGTGCAATATGTATTGTAGAATATTCTCCAAAAAAGAAAGAAATAATAGGATATCCTCAAAAGATATGGTCAGGAGGAACAGATAGAGGATGTATAGAAGCTCCACATCTAACAAAACGAGGAGATTATTACTATATAATGTGTGCAGAAGGAGGAACAGGATATGGACATAGCGTCACAATGGGAAGAAGTAAAAATATCTTAGGACCATTTGAAAAAGATCCTAAAAATCCAATTGTAACTTCTATTCCAGGGGACTTTAACGAAAGACACGACCCAGACCATTTAAAACCTAAATATTTTAATCCAAAATCAGTACTTCAAAAGTCTGGTCATGGAAGCTACGTAGAAACACCATTAGGTGAAGTATATCTCGTGCATCTTACAGCAAGACCATTTGTACCTGAATTAAGATGTACTTTAGGGAGAGAAACTGCTATTCAAAAAATGAAATGGACAGAAGATGGATGGCTTAGAATGGAAGATGGTTCTAATTTTGCAAAACAATATGTAGATGAAAGTAATATTAAAGAGTATAGAGTAGAGAAATGTCCGGATTTTGATGACTTTGATAAGGGAGAGCTAGGACTTCAATATTATTCTCCAAGAATAATGCCATCTTCTTTTGCAGATGTTAAAGCAAGACCTGGATATGTGAGAATAAGAGGGCAGGAATCAAGATGTTCATTAAATAAAGTAAGCATTCTTGCAAGAAAATTAACATCAGTTTATGCAGTAGTTACTACTAAAATGGAGTTTATTCCTTATGTGCACCAACATAGTGCAGGATTGATAATGTATTATGATAATATGAATTATATTTATCTAAGGAAATATTACAGCGACACATTAAAACATAGTGCAATTTCTGTAATTCATCTTGAAAACGGAGAAAAGACAGAATTTATAAATACAAGAACAAGGGTAGAAGATTCTCCTATTTACTTTAGATTGGTAATTGAAGGAAGAAAATCACATTTTGAATGGTCTTATGATGGTACAAACTATAGTGTAATCGGAAAAATCTTTGATACGACTAAATTTTCAGATGAATATTGTAAATATGGAGAATTTACAGGTACTTTTGTAGGAATAACATGTGCAGATAGAGTGCTTCATAAACATTATGCAGACTTTGATTTCTTTGAATATATAGCTGATGAAGATAAAAATGTAGAGTGA
>GenBank Accession:NZ_LJEQ01000017
ATGATTCAGAATCCTGTCTTTAGAGGTTTTAATCCTGACCCCTGTATATGCCGCCGCGGCGATGATTATTATGTGGCGGTATCATCATTTGAATGGTTTCCCGGACTACCGGTTTATCATTCAAGGGATCTTAAACATTGGCAATTGCTCACCCATGTGCTTACCGATGATAATAACCCAGACTTAAAAAAACTGCCTTCTGCCAAAGGTATTTGGGCCCCAAGCCTGACCTGGTGCGAAGAAGAAAAGCTGTTTTATGTCATTTATGGCGTTATGAATTCAATGAACTCCCGTTATTTTGACGTTGATAATTATCTTATTACCGCCGAGGAGATCACTGGCCCGTGGAGCGCCCCCGTTTATCTCCATTCTGCTGGATTCGATGCCTCAATTTTGCACGACCACGATGGCAGAAAATGGATCGTCTCACTGGAGTGGGAAACTCGCGAAGGCTATGAAAAACCGGGAGCGATCTGCCTGGTGGAATATTCGCCGCAGACGCACAGCGTCATCGGCTACCCGCAGCGCATCTGGCACGGCGGTACCGACCGCGGCTGCATTGAAGCGCCGCATCTGACGAGACGCGGCGACTATTACTACCTGATGGTCGCCGAAGGCGGCACCGGTTACGGGCATTCTGTCACCATGGCCCGCGCCACCGAGGTCGCTGGCCCCTATCAAGGCGATCCGCTCAATCCAATCGTCACATCCTGGCCAGAGAATTTTAACGAACGCAAAGATACAGGCCATCTCAAGCCCCACTATTTCAATCCGGAAACCTACCTGCAAAAGGCCGGGCACGGTAGCTATGTCGAGACCCCAACGGGTGAAGTCTGGCTAACCCACCTTTGCAGTCGTCCGTTTCGCCAGGAGCTACGCTGCCCGCTGGGTCGCGAAACGGCGATCCAAAGGATGGAATGGAGCGAGGATGGCTGGCTGCGCCTGGCCGCTGGGGGTCATCTCGCACAGCATCAGGTAGAAGGATCCCGTCTGCCGCCGCACCCGTTTCCACCAAAAGCGGACCTTGATGATTTTGATGAGCCAAGGGTGGATAACGCGTTTTATGCGCCGAGGATCCATTTCCAGCGTTTTACCTGCCTGACGCGTAAAGCAGGCTATCTTGCTCTACGCGGTCAGGAGTCGCTGAGCTCGCTCAATAAAGTCAGTCTGCTGGCGAAAAAGCTGACCTCGGTATACGCCAACATCAGTACGAAGATGGATTTTAACCCGGAGATTTATCAGCACAGCGCTGGACTGGTGCTCTATTACGACAACATGAACTATCTGTTTTTACATAAAACCTGGGATGAAACCTCTGGCGCGGCGCAACTGGCTATTATCTACATGGATAACGGTGAGCGTCACGACGATCCACAGAAAATTCGCCTTGCCGAGGGAGAAATCTATCTCGCGATTGCTATTAATGGCCGAGAGGTACAGTGTTCATGGTCTGTCGACGGTGAGCACTATCACCCCATCGGAGCGGTCTACGACACATCGCATTTCTCCGATGAATACAGCCGCTACGGCGAGTTTACCGGGGCGTTTGTCGGCATGGCCTGCGTGGATAGTATGCTGCATCGCAAAGAGGCGCTGTTTGATTTCTTCTGCTACCGGGCGGATGAAGACGCGATAATCGAGTAA
>GenBank Accession:NZ_LUTZ01000010
ATGTCCCTTATCCAAAACCCTATATTACGTGGTTTTAATGCCGACCCCAGCATTATCCGTGTTGAGGACACCTACTATATCGCCAACTCGACATTTGAGTGGTTTCCTGGCGTTCGTTTACATGAATCAAAAGATCTGAAGCACTGGAATCTTCTGCCGTCGCCATTGTCAACCACTACCCTCTTAGATATGAAGGGGAATCCGTCTTCAGGCGGTATTTGGGCTCCGGCGCTCTCTTATGCGGATGGTAAATTCTGGTTAGTGTATACCGATGTGAAAGTCACCGAAGGTGCCTTTAAAGACATGACAAACTACCTGACCACCGCAACAGATATTCGCGGCCCGTGGAGTGCGCCGATAAAACTGAATGGCGTGGGTTTCGATGCTTCACTTTTCCATGACGATGATGGCCGTAAATATATTGTGCAACAGACGTGGGATCATCGGGAGTACCACCATCCTTTTGATGGGATTACCTTAACAGAGCTTGATACAACAACTTTAAAATTAATGCCGGAAACCGCACGGACCATCTATCGCGGTACCGCAGTAGCGCTCGTTGAGGGGCCGCATCTCTATAAACTGAACGGTTATTACTATCTCTTTGCCGCTCAAGGGGGGACGGTATTTACTCATCAGGAGGTGGTGGCCCGTTCCACCACTTTAAATGCCGACAGCTTTGAAACCGAGCCTGGAGAAGTATTCTTAACTAACGTTGATACCCCTGACAGCTATATCCAGAAGCAAGGACACGGCGCTTTGGTTTCGACGCCCGAAGGCGAATGGTATTATGCTTCGCTGTGCGCTCGTCCGTGGAATCGTCCCGGAGAATCAATCTATGATCCTCGCGGCTGGTCTACCCTTGGCCGGGAAACGGCCATCCAAAAAGTATATTGGGATGAAGATGGCTGGCCGCGTATTGAAGGCGGTCACGGCGGTAAAACGTTTGTCGAGGGCCCGAAAGACGCCATTTATACCGAAAGCGCGAAAGATAATAGCCAGCACGATGAATTTACTACGCCAGCGCTTAATCTTAACTGGAATACCCTGCGGGTGCCTTTTAGCGAAAAAATGGGTACCACAGGCCATGGAAAGCTCACCTTAATCGGCCAGGGTTCATTAGCGAATACCCATGATCTGTCGTTGATTGCCCGCCGCTGGCAAGCCTTTTATTTTAATGCTGAAGTTAAAGTCGCCTTTAATCCTTTCAGCTACCAACAAATGGCCGGATTAACCAATTACTACAACGACCGTCACTGGAGCTTTGCTTTTGTTACCTGGAATGAAATTAACGGCAGAGTCATCGAGGTCGCCGAGAACAATCGTGGAAAATATACTTCGTACCTGAAAGATAACGCGATTAAGATTCCTGACGACGTCGAATACGTGTGGTTACGGACGAAAGTCAGGAAGCAGACCTATAGCTATGAGTATAGTTTTGACGGCGTCGATTTTATTGAAATTCCGGTAGTTTTAGATGCCGCCGTCCTTTCCGATGACTATGTTCTGCAAAGCTATGGCGGGTTCTTTACCGGTGCATTTGTAGGTCTTGCGGCAGTAGACTACTCAGGCTACGGCGCCAGCGCTGAATTTTATCATTTCGATTATCAAGAGTTTGGCGACTCGTTAATTGGCACGGATGTTTATAGTTGGGAGGCTGGCGAGCTACGTGCTGATTAG
>GenBank Accession:NZ_CAKZ01000125
ATGACTATCTATAAGGACCCAACCCGTCCGGTGGCTGAACGCGTCGCCGATTTGCTCGCCCGCATGACGCCGGAGGAGAAATTCGCCCAGATGCACGCTTACTGGCTGGTGCTGTCGCCGGAGGGCGATCACCGGGAGCGAACCGATTTGAGCGATGAGTTTTCCGGCGCGACCCAGCAGGCGGCGCTGACCGAACGTCTGAAACGCGGCGCGGGGCAGATAACCCGCCCGCTCGGCACCCATATCGTCGCGCCGCGGGAGGGCGTGCGCGCCGCTAACCGTCTGCAGAAAATGCTGGTGGAAGAGACGCGGCTCGGCATTCCCGCCATGTTCCATGAAGAGTGCCTGGTGGGGCTGCTATGCAAAGACGCGACGCTGTTTCCGTCGTCGCTCAACTACGGTTCCACCTGGGATCCGCAGCTGGTGGAGCAGGCCGCGCAGGCTATCGGGCGTGAAGCGCGCGCCGTCGGCTGTCATCAGGGCCTCGCGCCGGTGCTCGATGTGTCGCGCGACGTGCGCTGGGGCCGCACGGAAGAAACCTTTGGCGAAGATCCGTGGCTGGTGGGCGTGATGGCGACCCGCTACGTGAAAGGGCTACAGGGGCCGCAGCGGGATCTGCTCGCGACCCTCAAGCACTACGTCGGCCACTCGTTCAGCGAAGGCGCGCGCAACCATGCGCCGGTGCACCTCGGCTTTTGCGAACTCAACGACACCTTCCTGCTGCCGTTTGAAATGGCGGTGAAGCTGGCGCACGCGGGCTCGGTGATGCCTGCGTATCACGATATCGATAACGTGCCGACCCACGCCGATGATTTCCTCCTCACACAAGTATTGCGCGAACAATGGGGCTTTGACGGCATTATCGTCGCCGACTACGGTGGTGTAAGCCTGCTGCATCAGCACCACGGCGTGGCGCAGGACGCGGCGCACTCGGCGGCGCTGGCCTTTAATGCGGGGCTGGATATCGAACTGCCGAAAGACGACTGCGCGCGCCATCTGGCGCAGGCGCTGGCGCGCGGGCTTATCACGATGGAAAAAGTGGATAAAATTGTGGCGCGCGTGCTGGGCGAAAAGTTCCGTCTCGGACTGTTTGAACAGCCGTATGCCGATGAGAACGCCATTACGCTGCAAAGCGACGAGACGCGCCGCATCGCCCGCGAGGTGGCGGCGCGTTCCCTGACGCTGCTTGAGAATAACGGTGTGCTGCCGCTTCAGGGTACGCCGCGCGTGGCGGTAGTCGGGCCGACGGCGGACGATCCGCTGGCGCTGCTGAGCGGCTACAGCTTCCCGGTGCATCTCATCATCAGCGATATGCTGGAGCAGACAAGCCAGGTGACGACGCCGCTTGCCGCGCTGCGCGAACAGCTCGGTGGCGCGCTCGCAGGCTATGCCAAAGGCTGTCATATCATTGAGAAACGCATGGCGGGCGCGCCGGTGTTCCCCGGCGACAGCGGCGAAAAACCGATGCAGCAGTCGCCGGTCTCAGACGATGTGTCGCTCATCCCGGACGCAGTGGCGCTCGCCGGACAAAGCGACGTGGTGCTGGCGTTTGTCGGCGATCTCTCCGGGCTGTTCCAGAGCGGCACCGTGGGCGAAGGCTCTGACACCGACAGCCTGCAACTGCCGGGCGTGCAGCAACAGCTGCTGGAGGCGCTGGTGGAAACGGGTAAGCCGGTCGTGGTCGTGATGACCGGCGGTCGCCCTTATCACCTGGGCGGGCTGGAGTCGCGCGTCGCGGCGTGGGTGATGGCCTGGGCGCCGGGGCAGGAGGGCGGACACGCGATTGCGGATCTGCTGACCGGCAAAGCGGAACCGCAGGGCCGGTTAGTGGTGTCGGTGCCGAAAAGCGCAGGCGCGATGCCGTACTACTACAACCATAAGCTCAAAAGCGGCGGCACGCCTTACGCGTTTCACTTCGGCTCACGCTACCCGTTCGGCTACGGCAAAACCTGGACCGAATTCCGCTATGGCGCCCTGGACATCGCCCAGGCGCGCGTGCCGATGGCGGGCGAGGTGGAGGTATCGGTTACCGTCACCAACAGCGGCGCGCAGGCGGGCAGCGAGGTGGTACAGCTGTATGTGCACGATAAAGTAGCCTCGATGGTCCGCCCGGTGCAGGAGCTGAAAGCCTTCGGGCGCGTTACGCTTGCCCCTGGCGCCAGCGCGCGCGTTACGTTCCGCGTGCCGGTCGATATGCTCAGTTTCACGCGTCGTGATGGCGCACGCATTGTCGAGCCTGGCGAGTTTGACATTCGCGTTGGGGCCAATAGCGGCGATATTCGCAGTCGCGGCACCGTTATTGTGGAAGGCGAAACCCAGGTGCTGGGAAATAGCTGGCGGATGCTGAGTGAGTGTCATGTTGAGCAATAA
>GenBank Accession:NZ_AFMO01000196
GTGCCGAAAAGCGCGGGCGCGATGCCTTACTATTACAATCACAAGCTCAAAAGCGGCGGCACGCCTTACGCGTTTCACTTCGGCTCGCGCTACCCGTTCGGCTACGGCAAAACCTGGACAATGTTCCGCTACGGCGAGCTGGAAATCGCGCAGTCGCGCGTGCCGATGAATGGCGATGTTGAGGCATCTGTCACCGTCACTAACAGCGGTACGCAGCCGGGCAGCGAGGTGGTGCAGCTCTATGTGCGCGATAACGTGGCCTCGCTGGTGCGGCCGGTGCAGGAGCTGAAAGCCTTCGGGCGGGTCTCGCTTGCGCCTGGCGAGAGCGCGCGCGTTACGTTCCGCATCCCTGTCGATATGCTTAATTTCACGAGCCGCGATGGTGTTCGTATTGTCGAGCCTGGCGAGTTTGAAATCCGCGTCGGCGCTAACAGTGGCGATATCCGCAGCCGCGGCACCGTGACCGTGGAAGGGGAAACGCACGTGCTCGGAAATAACTGGCGTATGCTCAGTGAGTGTTGTGTTGAGCAATAA

To use hpcc: Lot’s of disc space, temporary unlimited (3 months)

ssh *****@hpcc.msu.edu
password:
#We are now at the gateway, we must connect to a development node
ssh dv-intel16
#Now we move to a scratch folder
cd /mnt/scratch/*****

Now that we are in a space where we can download data and perform work, let’s download the NCBI db

#Make a directory for this project
mkdir COBS_CAZY
cd COBS_CAZY/
#now we create a program to downlaod the db
emacs download.refseq.qsub

now lets download db to that ftp://ftp.ncbi.nlm.nih.gov refseq_protien.10.tar.gz.md5 and *.gz is what we want to put on HPCC

pwd
cd Documents/COBS_CAZY/scripts
cat download.refseq.qsub
/Users/jaredflater
#!/bin/bash ~login

### define resources needed:
### walltime - how long you expec the job to run
#PBS -l walltime=40:00:00

### nodes:ppn - how many nodes & cores per node (ppn) that you require
#PBS -l nodes=1:ppn=1

### mem: amount of memory that the job will need
#PBS -l mem=4gb

#PBS -q main
#PBS -M youremail@email.com
#PBS -m your department

### you can give your job a name for easier identification
#PBS -N yourname

### load module

### Change directory
cd /mnt/scratch/username/COBS_CAZY

### call
wget ftp://ftp.cbi.nlm.nih.gov/blast/db/refseq_protein.* -P resfeq_protein

In HPC: move out of dev intel…everyone is using it. There are many more computers to choose from begind that…called nodes We want to work in a node, we need to submit a job by using qsub

[*****@dev-intel16 ~]$ cd /mnt/scratch/***** [*****@dev-intel16 *****]$ ls [*****@dev-intel16 *****]$ mkdir COBS_CAZY [*****@dev-intel16 *****]$ cd COBS_CAZY/ [*****@dev-intel16 COBS_CAZY]$ emacs download.refseq.qsub

[1]+ Stopped emacs download.refseq.qsub [*****@dev-intel16 COBS_CAZY]$ ls download.refseq.qsub download.refseq.qsub~ [*****@dev-intel16 COBS_CAZY]$ rm *~ [*****@dev-intel16 COBS_CAZY]$ ls download.refseq.qsub [*****@dev-intel16 COBS_CAZY]$ qsub download.refseq.qsub 37523906.mgr-04.i [*****@dev-intel16 COBS_CAZY]$ ls download.refseq.qsub [*****@dev-intel16 COBS_CAZY]$

.o is output, .e is error

S = Q in line S = R running S = C complete

Check if everything downloaded correctly by using md5(it tells us if downloaded correctly) Once all is downlaoded, unzip

Then ready to run blastp (files in fasta) clone the github repository # note change spellin on protein for emacs

module load git (to upload repository) git clone url once repository is uploaded then blastp

submit blastp: qsub blastp.qsub

2016_12_02:09:06 AM, Noticed that we did not qsub blastp for EC *.86, let’s make that script and submit the job.

ssh *****@hpcc.msu.edu
password:
ssh dev-intel16
cd /mnt/scratch/*****/COBS_CAZY
ls
---
title: "COBS_CAZY"
output: html_notebook
root: ../../../
---
![](images/COBSPIC.png)
------
#COBS: A Background

COBS is a field trial comparing biofuel systems and represents a unique opprotunity to utilize NGS in order to explore the distrubution of bacterail species associated with carbon cycleing and how they are distrubuted amongst the aggregate fractions in. Why do we care about agreggate fractions and the bacterial distrubution in each? We know that tillage and cropping systems have an effect on the distribution of aggregate sizes. We also know that bacterial species eveness is not consistant accross aggregate fractions. Let's explore these points further by looking into the distribution of bacteria with the potential for carbohydrate degredation. The results from this study may inform agricultural management decisions based on the desired outcome for carbon cycling. I.E. more or less tillage to drive the aggregate size distribution and therefore the microbial community. Perhaps we can store more carbon in the soil with proper management techniques that facilitate the growth of a desired aggregate class based on the funciton you wish to achieve in the ecosystem. 

We can do this by quantifying the amount of bacteria that have genes associated with carbon degrading enzymes of intrest. These enzymes are: 

  1. endoglucanase [EC:3.2.1.4]
  2. pullulanase [EC:3.2.1.41]
  3. 1,4-beta-cellobiosidase [EC:3.2.1.91]
  4. beta-glucosidase [EC:3.2.1.21]
  5. beta-D-xylosidase 4 [EC:3.2.1.37]
  6. 6-phospho-beta-glucosidase [EC:3.2.1.86]

![](images/AGGdist.png)
First, let's confirm that we have the required tools for asking questions of the NCBI database. To do this we will use the NCBI API: Entrez Direct and install it via the UNIX command line. 
```{bash, eval=FALSE}
pwd
cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh
```
Let's take a look to confirm that it was installed in the home directory:
```{bash}
cd ~
ls
```
Now that we have the tools, what will we use them for? COBS is a unique dataset that includes many metagenomes from soil aggregates. Many functions in ecosystems are microbially mediated and catalyzed by enzymes. In this study, funded by the DOE, we are interested in carnon cycleing and therefore the enyzmes associated with that funciton. Our collaborator (Kirsten Hofmockel) has identified a few enzymes of interest for our system. The list of enzymes is contained in a text file: ec_numbers.txt
Let's navigate to that directory and take a look:
```{bash}
cd ~/Documents/COBS_CAZY
cat ec_numbers.txt
```
We now have a list of enzymes of interest, let's use those NCBI tools to find bacterial sequences associated with those enzymes by creating a python script:  
```{python, eval=FALSE}
import sys
from Bio import Entrez, SeqIO

Entrez.email = 'jflater@iastate.edu'

# First, find entries that contain the E.C. number
ec_num = sys.argv[1].strip()
#print ec_num
#print 'E.C. '+ ec_num
esearch_handle = Entrez.esearch(db='nucleotide', term='EC '+ec_num)
# When term='E.C. we get zero results, however, if term=EC it works
entries = Entrez.read(esearch_handle)
esearch_handle.close()

# Second, fetch these entries
efetch_handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb', retmode='xml') 
records = Entrez.parse(efetch_handle)

# Now, we go through the records and look for a feature with name 'EC_number'
for record in records:
      for feature in record['GBSeq_feature-table']:
          for subfeature in feature['GBFeature_quals']:
              if (subfeature['GBQualifier_name'] == 'EC_number'   and
                subfeature['GBQualifier_value'] == ec_num):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

                    print('>GenBank Accession:{}'.format(accession))
                    print(seq.seq)
efetch_handle.close()

```
Now let's apply that script to our list of EC #'s by using the command line:
```{bash, eval=FALSE}
while read line;     
  do python scripts/nucl_from_ec.py $line > "$line".txt;    
  done < ec_numbers.txt
```

```{bash}
cd Documents/COBS_CAZY
less 3.2.1.37.txt 
```

To use hpcc:
Lot's of disc space, temporary unlimited (3 months)
```{bash}
ssh *****@hpcc.msu.edu
password:
#We are now at the gateway, we must connect to a development node
ssh dv-intel16
#Now we move to a scratch folder
cd /mnt/scratch/*****
```

Now that we are in a space where we can download data and perform work, let's download the NCBI db
```{bash}
#Make a directory for this project
mkdir COBS_CAZY
cd COBS_CAZY/
#now we create a program to downlaod the db
emacs download.refseq.qsub
```

now lets download db to that  ftp://ftp.ncbi.nlm.nih.gov
refseq_protien.10.tar.gz.md5 and *.gz is what we want to put on HPCC
```{bash}
pwd
cd Documents/COBS_CAZY/scripts
cat download.refseq.qsub
```

In HPC: move out of dev intel...everyone is using it. There are many more computers to choose from begind that...called nodes
We want to work in a node, we need to submit a job by using qsub
```{bash, eval=FALSE}

```

[*****@dev-intel16 ~]$ cd /mnt/scratch/*****
[*****@dev-intel16 *****]$ ls
[*****@dev-intel16 *****]$ mkdir COBS_CAZY
[*****@dev-intel16 *****]$ cd COBS_CAZY/
[*****@dev-intel16 COBS_CAZY]$ emacs download.refseq.qsub

[1]+  Stopped                 emacs download.refseq.qsub
[*****@dev-intel16 COBS_CAZY]$ ls
download.refseq.qsub  download.refseq.qsub~
[*****@dev-intel16 COBS_CAZY]$ rm *~
[*****@dev-intel16 COBS_CAZY]$ ls
download.refseq.qsub
[*****@dev-intel16 COBS_CAZY]$ qsub download.refseq.qsub
37523906.mgr-04.i
[*****@dev-intel16 COBS_CAZY]$ ls
download.refseq.qsub
[*****@dev-intel16 COBS_CAZY]$

.o is output, .e is error

S = Q in line
S = R running
S = C complete

Check if everything downloaded correctly by using md5(it tells us if downloaded correctly)
Once all is downlaoded, unzip



Then ready to run blastp (files in fasta) clone the github repository
# note change spellin on protein for emacs

module load git (to upload repository)
git clone url
once repository is uploaded then blastp

submit blastp: qsub blastp.qsub

#2016_12_02:09:06 AM, Noticed that we did not qsub blastp for EC *.86, let's make that script and submit the job. 
```{bash, eval=FALSE}
ssh *****@hpcc.msu.edu
password:
ssh dev-intel16
cd /mnt/scratch/*****/COBS_CAZY
ls

```

